We, data scientists, work with data and endeavor to make sensible decisions based on data. We analyze data by first understanding the data and then extracting useful information to assist decision making.
One without rich knowledge in medicine and years of experience in clinical practice cannot be a doctor, not to mention a good doctor. Same applies to a data scientist, we need to accumulate experience in dealing with data by hands-on experience. Deciding on what data to collect and what we are looking for from the data are is very important first step toward every successful study. This requires you to equip yourself with a toolset, i.e. R packages/functions, in manipulating, summarizing and visualizing data, which is the cornerstone of our journey towards a great data scientist.
Useful packages:
We will focus on four specific packages dplyr, ggplot, purrr and data.table. dplyr, ggplot, and purrr are part of a broader library called tidyverse which consolidates commonly-used data science packages. See more here: https://www.tidyverse.org/.
dplyr can be used for data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. Cheatsheetggplot is used for creating Graphics or Plots. Cheatsheetpurrr is a functional programming toolkit that helps you replace for loops to improve readability. Cheatsheetdata.table is an R package that provides an enhanced version of data.frames. Its speed makes it a preferable package to manipulate BIG data. CheatsheetIn addition to tidyverse and data.table, packages skimr, nycflights13, gapminder, ggthemes, ggpubr and plotly will be used in the tutorial. We use the p_load function of pacman, to load all the packages at once. It will first check whether packages are installed. If not, it will install the yet to be installed automatically and then load all the packages.
dplyrBefore we dig into details about dplyr, let us first find out information about dplyr. We may always want to Google first: r:dplyr. Alternatively, most R-packages provide vignettes which explain the package, usage, etc. We may use vignette() or browseVigenttes("package name") to get the vignette.
browseVignettes("dplyr") # web version
vignette("dplyr") # version inside RstudioThere are several documents in the vignette for dplyr. Let us explore Introduction to dplyr where it includes dplyr.rmd, dplyr.html and dplyr.R. (The very first draft of the Advanced R Tutorial contains some materials from dplyr.rmd).
We next start to highlight dplyr.
Below is a table of popular dplyr commands. We will go through each one.
| dplyr Command | SQL equivalent | Action |
|---|---|---|
filter() |
WHERE | Limit based on condition |
select() |
SELECT | Choose only certain variables |
distinct() |
DISTINCT | De-duplicate result-set |
arrange() |
ORDER BY | Order results |
rename() |
SELECT | Rename variables |
mutate() |
SELECT | Create new variable |
group_by() |
GROUP BY | Group rows |
summarise() |
SELECT | Create new variable in grouped setting |
join() |
JOIN | Join tables |
You can also view the full cheat sheet here: DPLYR Cheat Sheet
To illustrate these commands, we will use a dataset called flights in the nycflights13 package that contains 336,776 flights departed from New York City in 2013. dplyr allows you to gather insight from a dataset without altering the original dataset. It is considered best practice not to alter the original dataset. For example, in this case, we will never overwrite the existing dataset ‘flights’. We will first take a look at the the dataset and summary statistics.
library(nycflights13) # since we have load nycflights13 in the first r chunk, we do not need to library it again.
names(flights) # find the variables
dim(flights) # size of the data
str(flights) # structure of the data
summary(flights) # may spot peculiar/unusual...Notice that there are many missing values in the dataset. This raises alerts to us about the data. Since our focus in this lecture is about various R packages, we will not get into how to deal with missing values at the moment. BUT IT IS A VERY IMPORTANT ISSUE we always need to address in data analyses.
We could use the package skimr to get a better organized summary.
# summary(flights)
skim(flights) # it doesn't seem to report missing valuesThe %>% command is called a pipe. This means that the result of the code before %>% is sent, or “piped”, to the one after %>%. Piping is a powerful tool for clearly expressing a sequence of multiple operations, as we will see shortly.
Here’s an example of using piping to first square 1:5 and then sum them up. . indicates the previous piping output. The following is equivalent to sum((1:5)^2).
1:5 %>%
.^2 %>%
sum## [1] 55
The filter command will only display the subset of your dataset that match a certain condition. This command will only show flights on Jan 1st, 2013.
This code is the same as doing filter(flights, month == 1 & day == 1) since the %>% command passes the flights dataframe to the filter command. & means the AND operation, | means the OR operation.
flights %>%
filter(month == 1 & day == 1)## # A tibble: 842 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 832 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
It is important to remember that this command does not alter the original flight dataset. If we want to save this subset as its own object, we run the following. Remember the <- is the assignment operator in R.
filteredFlight <- flights %>%
filter(month == 1 & (day == 1 | day == 2))
filteredFlight## # A tibble: 1,785 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 1,775 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# View(filteredFlight)Multiple conditions can be included in a filter command. The command below shows any flights from Jan through June to PHL or SLC airports.
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6)## # A tibble: 2,116 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 655 655 0 1021 1030
## 2 2013 1 1 908 915 -7 1004 1033
## 3 2013 1 1 1047 1050 -3 1401 1410
## 4 2013 1 1 1245 1245 0 1616 1615
## 5 2013 1 1 1323 1300 23 1651 1608
## 6 2013 1 1 1543 1550 -7 1933 1925
## 7 2013 1 1 1600 1610 -10 1712 1729
## 8 2013 1 1 1909 1912 -3 2239 2237
## 9 2013 1 1 1915 1920 -5 2238 2257
## 10 2013 1 1 2000 2000 0 2054 2110
## # … with 2,106 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Select will only return columns that are listed. In this case, the resulting dataset will consist of the Origin, Destination, and Carrier of flights that were destined for PHL or SLC in the first 6 months of the year. Remember, the pipe command sends the result of the current line to the next line. In this case, the filtered dataset is then piped into the select command.
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6) %>%
select(origin, dest, carrier)## # A tibble: 2,116 x 3
## origin dest carrier
## <chr> <chr> <chr>
## 1 JFK SLC DL
## 2 LGA PHL US
## 3 JFK SLC DL
## 4 JFK SLC DL
## 5 EWR SLC DL
## 6 JFK SLC DL
## 7 JFK PHL 9E
## 8 JFK SLC B6
## 9 JFK SLC DL
## 10 JFK PHL 9E
## # … with 2,106 more rows
On the contrary, we can use - to deselect columns. If we want to drop year, month and day, we just need to prefix - to each column name.
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6) %>%
select(-year, -month, -day)## # A tibble: 2,116 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 655 655 0 1021 1030 -9 DL
## 2 908 915 -7 1004 1033 -29 US
## 3 1047 1050 -3 1401 1410 -9 DL
## 4 1245 1245 0 1616 1615 1 DL
## 5 1323 1300 23 1651 1608 43 DL
## 6 1543 1550 -7 1933 1925 8 DL
## 7 1600 1610 -10 1712 1729 -17 9E
## 8 1909 1912 -3 2239 2237 2 B6
## 9 1915 1920 -5 2238 2257 -19 DL
## 10 2000 2000 0 2054 2110 -16 9E
## # … with 2,106 more rows, and 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Distinct will remove any duplicate rows from the given dataset. Notice in the previous command, it returned a subset with 2116 rows, but with distinct, we can see that only 8 carriers flew to PHL or SLC in the first half of the year.
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6) %>%
select(origin, dest, carrier) %>%
distinct()## # A tibble: 8 x 3
## origin dest carrier
## <chr> <chr> <chr>
## 1 JFK SLC DL
## 2 LGA PHL US
## 3 EWR SLC DL
## 4 JFK PHL 9E
## 5 JFK SLC B6
## 6 EWR PHL EV
## 7 JFK PHL US
## 8 JFK PHL DL
Arrange puts your data into alphabetical order. In this case the order is first by origin, then descending alphabetical order of the destination, then alphabetical order of carrier.
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6) %>%
select(origin, dest, carrier) %>%
distinct() %>%
arrange(origin, desc(dest), carrier)## # A tibble: 8 x 3
## origin dest carrier
## <chr> <chr> <chr>
## 1 EWR SLC DL
## 2 EWR PHL EV
## 3 JFK SLC B6
## 4 JFK SLC DL
## 5 JFK PHL 9E
## 6 JFK PHL DL
## 7 JFK PHL US
## 8 LGA PHL US
The Rename function can be used to easily rename a column Header. Here, we rename carrier to airline. The syntax is as follows. rename(<newname> = <oldname>)
flights %>%
filter(dest %in% c("PHL","SLC") & month <= 6) %>%
select(origin, dest, carrier) %>%
distinct() %>%
arrange(origin, desc(dest), carrier) %>%
rename(airline = carrier)## # A tibble: 8 x 3
## origin dest airline
## <chr> <chr> <chr>
## 1 EWR SLC DL
## 2 EWR PHL EV
## 3 JFK SLC B6
## 4 JFK SLC DL
## 5 JFK PHL 9E
## 6 JFK PHL DL
## 7 JFK PHL US
## 8 LGA PHL US
Mutate is used to create new columns based on current ones. This feature is very useful. Here, we create three new variables “gain”, “speed”, and “gain_per_hour”. Notice how “gain_per_hour” uses the column “gain”, which was created in the same mutate statement.
flights %>%
mutate(gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
gain_per_hour = gain / (air_time / 60)) %>%
select(dep_delay, arr_delay, gain, distance, distance, air_time, speed, gain_per_hour)## # A tibble: 336,776 x 7
## dep_delay arr_delay gain distance air_time speed gain_per_hour
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 11 -9 1400 227 370. -2.38
## 2 4 20 -16 1416 227 374. -4.23
## 3 2 33 -31 1089 160 408. -11.6
## 4 -1 -18 17 1576 183 517. 5.57
## 5 -6 -25 19 762 116 394. 9.83
## 6 -4 12 -16 719 150 288. -6.4
## 7 -5 19 -24 1065 158 404. -9.11
## 8 -3 -14 11 229 53 259. 12.5
## 9 -3 -8 5 944 140 405. 2.14
## 10 -2 8 -10 733 138 319. -4.35
## # … with 336,766 more rows
Use transmute if you only want to keep the new variables.
flights %>%
transmute(gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
gain_per_hour = gain / (air_time / 60)) ## # A tibble: 336,776 x 3
## gain speed gain_per_hour
## <dbl> <dbl> <dbl>
## 1 -9 370. -2.38
## 2 -16 374. -4.23
## 3 -31 408. -11.6
## 4 17 517. 5.57
## 5 19 394. 9.83
## 6 -16 288. -6.4
## 7 -24 404. -9.11
## 8 11 259. 12.5
## 9 5 405. 2.14
## 10 -10 319. -4.35
## # … with 336,766 more rows
Group by and summarise often go together to get summary statistics for each group. Reorganize dataframe by rows according to the column that is grouped by; summarise then gives statistics of that group. Here, the origin column had three categories, EWR, JFK, & LGA. The group_by(origin) command organizes the dataframe by the three origins. Then summarise() is used to get metrics related to each origin.
From this table, we can see that EWR had the most flights with 120835, and LGA had the lowest avg delay at 10.34.
flights %>%
group_by(origin) %>%
summarise(num_of_flights = n(),
avg_delay = mean(dep_delay, na.rm = TRUE)) # na.rm removes any NA data## # A tibble: 3 x 3
## origin num_of_flights avg_delay
## <chr> <int> <dbl>
## 1 EWR 120835 15.1
## 2 JFK 111279 12.1
## 3 LGA 104662 10.3
group_by can also take expressions. The following returns the number of flights that started late AND arrived late; started early AND arrived early (or on time).
flights %>%
filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
group_by(dep_arr_late = dep_delay > 0 & arr_delay > 0) %>%
summarise(num_of_flights = n())## # A tibble: 2 x 2
## dep_arr_late num_of_flights
## <lgl> <int>
## 1 FALSE 235043
## 2 TRUE 92303
Summarise has a number of other functions that can be used within it. For example, the commonly-used mean(), sum(), max(), min(). Sometimes we might be interested in value in specific position in each group, such that first() and last() retrieves the first and last value respectively. nth(, index) retrieves the nth value. You can combine arrange() and nth() to get the n-th largest value.
flights %>%
distinct(origin, dest, distance) %>%
group_by(origin) %>%
arrange(origin, -distance) %>%
summarise(thrid_farthest = nth(distance, 3))## # A tibble: 3 x 2
## origin thrid_farthest
## <chr> <dbl>
## 1 EWR 2565
## 2 JFK 2576
## 3 LGA 1416
n_distinct(dest) returns the number of distinct destinations. From this table we can see that EWR has flights to the largest number of destinations (56). We can also see LGA flights have a lower average distance than those of EWR & JFK.
flights %>%
group_by(origin) %>%
summarise(destinations = n_distinct(dest),
avg_distance = mean(distance, na.rm = TRUE))## # A tibble: 3 x 3
## origin destinations avg_distance
## <chr> <int> <dbl>
## 1 EWR 86 1057.
## 2 JFK 70 1266.
## 3 LGA 68 780.
group_by also takes multiple variables. For example, say we want to know what the farthest flight to leave NYC is. To answer this, we can group by origin and destination, summarise the max distance for each pair, and then order by the maximum distance value we created. It is now easy to see that the max distance was from EWR or JFK to HNL.
flights %>%
group_by(origin, dest) %>%
summarise(max_distance = max(distance)) %>%
arrange(desc(max_distance))## `summarise()` has grouped output by 'origin'. You can override using the `.groups` argument.
## # A tibble: 224 x 3
## # Groups: origin [3]
## origin dest max_distance
## <chr> <chr> <dbl>
## 1 JFK HNL 4983
## 2 EWR HNL 4963
## 3 EWR ANC 3370
## 4 JFK SFO 2586
## 5 JFK OAK 2576
## 6 JFK SJC 2569
## 7 EWR SFO 2565
## 8 JFK SMF 2521
## 9 JFK LAX 2475
## 10 JFK BUR 2465
## # … with 214 more rows
In addition to summarise(), group_by() can be useful for other proposes. For example, get the summary statistics of arr_delay by carrier.
flights %>%
# filter(carrier %in% c("AA", "UA")) %>%
group_by(carrier) %>%
skim(arr_delay)Another example is to find out some of the worst flights of AA and UA.
flights %>%
filter(carrier %in% c("AA", "UA")) %>%
group_by(carrier, month) %>%
filter(rank(-arr_delay) < 3) %>%
arrange(carrier, month, -arr_delay)## # A tibble: 48 x 19
## # Groups: carrier, month [24]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 2 1607 1030 337 2003 1355
## 2 2013 1 1 2205 1720 285 46 2040
## 3 2013 2 11 1416 810 366 1845 1315
## 4 2013 2 14 1447 920 327 1735 1240
## 5 2013 3 8 1933 1345 348 2231 1705
## 6 2013 3 1 1528 920 368 1738 1240
## 7 2013 4 19 606 1725 761 923 2020
## 8 2013 4 19 617 1700 797 858 1955
## 9 2013 5 19 713 1700 853 1007 1955
## 10 2013 5 16 2233 1340 533 59 1640
## # … with 38 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
When you have two tables, you might want to merge them together. For example, airlines data is a table of carries’ symbol and their corresponding full name.
airlines## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
Now we want to get the total number of flights of each carriers with their full name attached. First, we will use group_by() and summarise() to get the total number of flights for each carriers. Then, we use a join() function to join the result table with airlines. The by argument tells join() function which variable is the key to join. In our example, we want to join two tables by the carrier variables, so carrier is our key variable to join.
flights %>%
group_by(carrier) %>%
summarise(num_of_flights = n()) %>%
left_join(airlines, by = "carrier") %>%
arrange(-num_of_flights)## # A tibble: 16 x 3
## carrier num_of_flights name
## <chr> <int> <chr>
## 1 UA 58665 United Air Lines Inc.
## 2 B6 54635 JetBlue Airways
## 3 EV 54173 ExpressJet Airlines Inc.
## 4 DL 48110 Delta Air Lines Inc.
## 5 AA 32729 American Airlines Inc.
## 6 MQ 26397 Envoy Air
## 7 US 20536 US Airways Inc.
## 8 9E 18460 Endeavor Air Inc.
## 9 WN 12275 Southwest Airlines Co.
## 10 VX 5162 Virgin America
## 11 FL 3260 AirTran Airways Corporation
## 12 AS 714 Alaska Airlines Inc.
## 13 F9 685 Frontier Airlines Inc.
## 14 YV 601 Mesa Airlines Inc.
## 15 HA 342 Hawaiian Airlines Inc.
## 16 OO 32 SkyWest Airlines Inc.
There are different types of joins: left_join(), right_join(), inner_join(), and full_join().
left_join(): keep all observations in x, which is flights in our example. If there’s no corresponding key value in y table, the joined table will have NA.right_join(): keep all observations in y, which is airlines in our exampleinner_join(): keep match pairs only and drop rows that do NOT have keys in both tables.full_join(): keeps all observations in x and y.Here’s a Venn diagram to depict them.
To compare different joins, let’s create a look up table that includes United Airlines, American Airlines, and Lufthansa. We compare the join results using inner_join(), left_join() and right_join().
my_airlines <- data.frame(abbr = c("AA", "LH", "UA"),
name = c("American Airline", "Lufthansa", "United Airline"))Since only AA and UA are in both tables, the inner-joined table will only keep AA and UA and drop LH.
# inner_join
flights %>%
group_by(carrier) %>%
summarise(num_of_flights = n()) %>%
inner_join(my_airlines, by = c("carrier" = "abbr")) %>%
arrange(-num_of_flights)## # A tibble: 2 x 3
## carrier num_of_flights name
## <chr> <int> <chr>
## 1 UA 58665 United Airline
## 2 AA 32729 American Airline
Left-joined table will keep all observations in x (flights). So for carries that are not in my_airlines, they have NA name.
# left_join
flights %>%
group_by(carrier) %>%
summarise(num_of_flights = n()) %>%
left_join(my_airlines, by = c("carrier" = "abbr")) %>%
arrange(-num_of_flights)## # A tibble: 16 x 3
## carrier num_of_flights name
## <chr> <int> <chr>
## 1 UA 58665 United Airline
## 2 B6 54635 <NA>
## 3 EV 54173 <NA>
## 4 DL 48110 <NA>
## 5 AA 32729 American Airline
## 6 MQ 26397 <NA>
## 7 US 20536 <NA>
## 8 9E 18460 <NA>
## 9 WN 12275 <NA>
## 10 VX 5162 <NA>
## 11 FL 3260 <NA>
## 12 AS 714 <NA>
## 13 F9 685 <NA>
## 14 YV 601 <NA>
## 15 HA 342 <NA>
## 16 OO 32 <NA>
Right-joined table will keep all observations in y (my_airlines), so LH has NA number of flights.
flights %>%
group_by(carrier) %>%
summarise(num_of_flights = n()) %>%
right_join(my_airlines, by = c("carrier" = "abbr")) %>%
arrange(-num_of_flights)## # A tibble: 3 x 3
## carrier num_of_flights name
## <chr> <int> <chr>
## 1 UA 58665 United Airline
## 2 AA 32729 American Airline
## 3 LH NA Lufthansa
More on joining tables, see here.
Get the cumulative amounts of flights of each airline by month. Hint: use the cumulative sum function cumsum().
ggplotggplot is THE tool for exploratory data analysis. Exploratory data analysis is a way to develop an understanding of your data and we cannot stress more about understanding your data as data scientists. ggplot is one of the best tools for plotting graphs among other software/libraries (Matlab, Python’s matplotlib). Our goal here is to let graphs tell the number using ggplot.
gg means the grammar of graphics. The basic idea of ggplot is to independently specify building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:
In ggplot land aesthetic means “something you can see”. Examples include:
Gapminder is an non-profit organization. The goal is to give correct view about the world. The package gapminder contains a data set also named gapminder. But this dataset is about world life expectancy statistics. Let’s do a quick summary.
summary(gapminder)
# skim(gapminder)| Data | Plots | Geom (ggplot command) |
|---|---|---|
| One Continuous | Histogram | geom_histogram |
| One Continuous + One Categorical | Boxplot | geom_boxplot |
| Two Continuous | Scatter Plot | geom_point |
| Three Continuous | Scatter Plot + Size | geom_point w/ size aesthetic |
| Two Continuous + One Categorical | Scatter Plot + Color | geom_point w/ color aesthetic |
| Categorical with reasonable number of levels | Faceting!! | facet_wrap() |
Note: Time is always the x-axis.
There are many more geom types, but we will focus on the ones listed in the table above.
Here is an extremely useful cheatsheet that shows all of ggplots functions and how to use them.
The following shows the histogram of life expectancy in 2007. Life expectancy is a continuous variable, so we use geom_histogram().
Note how the %>% or “piping” also works with ggplot. If you are not piping in a dataframe, the first input to ggplot should be your dataframe. For example, the command would become ggplot(gapminder, aes(x = lifeExP)) + geom_histogram(binwidth = 2)
hist(gapminder$lifeExp)gapminder %>%
ggplot(aes(x = lifeExp)) +
geom_histogram(binwidth = 2) What has caught your eyes right away?
Life expectancy in other years:
summary(gapminder)## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
gapminder %>% filter(year == 2005) %>% # 2005 > no histogram, why not?
ggplot(aes(x = lifeExp)) +
geom_histogram(binwidth = 2)# see what year we have
unique(gapminder$year)## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
gapminder %>%
filter(year == 1952) %>% # change to other years
ggplot(aes(x = lifeExp)) +
geom_histogram(binwidth = 2)gapminder %>%
filter(year == 2007) %>% # change to other years
ggplot(aes(x = lifeExp)) +
geom_histogram(binwidth = 2)# put two together using facet_wrap()
gapminder %>%
filter(year %in% c(1952, 2007)) %>%
ggplot(aes(x = lifeExp)) +
geom_histogram(binwidth = 2) +
facet_wrap(~year)Now, we want to show lifeExp broken down by continent. Continent is a categorical variable, also called factors in R. For this, we use the geom_boxplot() command.
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()Using geom_point() we create a scatter plot of our two continuous variables, gdpPercap and LifeExp.
plot(gapminder$gdpPercap, gapminder$lifeExp, pch=16)gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point()Some relationships will look better on different scales, and ggplot allows you to change scales very quickly. Here we log the x-axis, with scale_x_log10(), which makes the relationship between these two variables much clearer. Use size and shape argument in geom_point() to adjust the size and the shape. The options of shape is indexed as follows.
Credit to https://blog.albertkuo.me/post/point-shape-options-in-ggplot/
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(size = 1, shape = 5) +
scale_x_log10()If we want to show three continuous variables at the same time, we can use the size aesthetic in ggplot. This will alter the size of the point by the value in the pop column of the gapminder data frame.
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = gdpPercap, y = lifeExp, size = pop)) +
geom_point() +
scale_x_log10()To show more insight into this graph, we can show each point by which continent it is from. Adding the color Aesthetic allows us to show a categorical variable, continent, as each point is colored by what continent it is from.
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent)) +
scale_x_log10()ggplot can also quickly add a linear model to a graph. There are also other models geom_smooth can do (“lm”, “glm”, “gam”, “loess”, “rlm”). If you leaving it blank it will automatically choose one for you, but that is not recommended.
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(se = TRUE) To add the linear model line, we add
geom_smooth(method = 'lm', se = TRUE) to the command. se = TRUE tells it to plot the standard error ranges on the graph.
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
geom_point(aes(col = continent)) +
geom_smooth(method = 'lm', formula = 'y~x', se = TRUE) Note that
aes() in ggplot() is a global setting, it will affect all geometric objects. It is the best practice to put aesthetic setting for each geometric objects. See the difference in the next plot.
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = log(gdpPercap), y = lifeExp, col = continent)) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y~x', se = TRUE)## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf
So far we have been plotting a cross-sectional plot. We can use geom_line() with x-axis as year and y-axis as life expectancy. We plot the trend of life expectancy of US.
gapminder %>%
filter(country == "United States") %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line()Now we want to plot the life expectancy trend of each country. We call it the spaghetti plot. We need to tell geom_line() to group the line by country. We use different colors to separate continents.
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(group = country, col = continent))It is a bit hard to see the pattern. We can add the trend of mean of each continent in our plot. In addition to the trend, we need to call another geom_line() to plot the mean. To make the mean trend stand out, we set alpha for the transparency of the spaghetti.
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(group = country, col = continent), alpha = .2) +
geom_line(data = gapminder %>%
group_by(continent, year) %>%
summarise(mean = mean(lifeExp)),
aes(x = year, y = mean, group = continent, col = continent),
size = 1)## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
Another trend plot is that we want to see how each group changes. We can use barplot with fill by each group. There are three position options for geom_bar():
gapminder %>%
ggplot(aes(x = year, y = pop, fill = continent)) +
geom_bar(position = "stack", stat = "identity") gapminder %>%
ggplot(aes(x = year, y = pop, fill = continent)) +
geom_bar(position = "fill", stat = "identity") Instead of changing the color of points on the graph by continent, you can also create a different graph for each continent by ‘faceting’. Depending on the number of factors and your dataset, faceting may look better than just changing colors. To do this we add the facet_wrap(~ continent) command.
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
scale_x_log10() +
facet_wrap(~continent)You can facet with any geom type. For example, our spaghetti plot.
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(group = country, col = continent), alpha = .2) +
geom_line(data = gapminder %>%
group_by(continent, year) %>%
summarise(mean = mean(lifeExp)),
aes(x = year, y = mean, group = continent, col = continent),
size = 1) +
facet_wrap(~ continent)## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
Another example, linear model of each continent. We can use scales = "free" in facet_wrap() to free the scales of x, y axes. It is generally NOT recommend to do so since it can be misleading.
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y~x', se = TRUE)+
facet_wrap(~ continent)## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning -
## Inf
ggpubr::ggarange()To put multiple plots together, use ggarrange() from ggpubr package.
p1 <- gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(group = country, col = continent), alpha = .2) +
geom_line(data = gapminder %>%
group_by(continent, year) %>%
summarise(mean = mean(lifeExp)),
aes(x = year, y = mean, group = continent, col = continent),
size = 1)## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
p2 <- gapminder %>%
ggplot(aes(x = year, y = pop, fill = continent)) +
geom_bar(position = "stack", stat = "identity")
# put p1 and p2 together
p <- ggpubr::ggarrange(p1, p2, common.legend = T, legend = "bottom", ncol = 2)
annotate_figure(p,
top = text_grob("Life expectancy and Population",
color = "black", face = "bold", size = 14))ggsave()Use ggsave() to save plots to png, jpeg or pdf. It plots the previous plot until we specify which plot to save.
ggsave(filename = "fig/life_exp_population.pdf", width = 8, height = 5)plotly packagePlotly is an interactive web-based library. It is well-integrated with ggplot2 via ggplotly(). You can feed in any ggplots to ggplotly(), it will output an interactive plot. Note that since it is web-based, it only works when you knit to html.
p <- gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(group = country, col = continent), alpha = .2) +
geom_line(data = gapminder %>%
group_by(continent, year) %>%
summarise(mean = mean(lifeExp)),
aes(x = year, y = mean, group = continent, col = continent),
size = 1)## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
ggplotly(p)One thing worth mentioning is animation with plotly. Use frame argument.
p <- ggplot(gapminder, aes(gdpPercap, lifeExp, color = continent)) +
geom_point(aes(size = pop, frame = year, ids = country)) +
scale_x_log10()## Warning: Ignoring unknown aesthetics: frame, ids
ggplotly(p)ggplot makes it very easy to create very nice graphs, using whats called a theme. There are many default themes or you can make your own. Almost everything is customizable on ggplot, including sizes, colors, font types, etc. Below is a example of building a theme.
To use a theme, it is simply added on the end of your ggplot string of commands. You can also add titles with ggtitle(), change labels with xlab() and ylab() command.
We might want to change the label or color of the legend. Scale family will come in handy.
We can use scale_color_brewer() or scale_fill_brewer() with palette argument depends on you are using col or fill aesthetics. Here are the available palettes:
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
geom_point(aes(col = continent)) +
scale_color_brewer(palette = "Spectral",
labels = c("AFRICA", "AMERICA", "ASIA", "EUROPE", "OCEANIA")) +
labs(colour = "CONTINENT")gapminder %>%
ggplot(aes(x = year, y = pop, fill = continent)) +
geom_bar(position = "stack", stat = "identity") +
scale_fill_brewer(palette = "Spectral", direction = -1)We can also scale by manual or scale by gradient for continuous variable.
ggthemesYou can also use pre-existing themes. ggthemes recreates popular themes from websites and magazines. See its gallery here. Here is one based on The Economist magazine.
apply and purrr::map familyIn basic R tutorial, we mentioned that R is a functional programming language. Functional programming languages write programs by applying and composing functions. It allows us to pass a function into a function.
For example, if we want to get the mean and sd for each column of a data frame, we can do it one by one, or by a for loop, or using the advantage of R being a functional programming language. (Remember that each column of one data frame is a list!)
gapminder_sub <- gapminder %>% select(lifeExp, pop)
# get mean
gapminder_sub_mean <- rep(NA, ncol(gapminder_sub))
for(i in seq_along(gapminder_sub)) {
gapminder_sub_mean[i] <- mean(gapminder_sub[[i]])
}
# get sd
gapminder_sub_sd <- rep(NA, ncol(gapminder_sub))
for(i in seq_along(gapminder_sub)) {
gapminder_sub_sd[i] <- sd(gapminder_sub[[i]])
}Then we might come up with more statistics to calculate for each column. Copying and pasting the for loop code again makes your code redundant and hard to read. And this practice is always prone to error and very inflexible. More importantly, as your code get complicated, it gets very hard to debug. What if we want to do the same thing to other data set?
R allows us to pass a function to function. Instead of writing three functions that have a similar structure with only different being mean() or sd(), we can pass a function as argument into a function.
col_summary <- function(df, fun) {
output <- rep(NA, ncol(df))
for(i in seq_along(df)) {
output[i] <- fun(df[[i]])
}
output
}
col_summary(gapminder_sub, mean)## [1] 5.947444e+01 2.960121e+07
We can even pass our self-defined function into col_summary().
log_mean <- function(x) {
mean(log(x))
}
col_summary(gapminder_sub, log_mean)## [1] 4.059839 15.766110
Apply a function to each element of a list/vector is so common that apply class function and purrr package are designed specifically for this purpose. They are implemented in C, which makes them a little faster.
lapply() and map()lapply() and map() takes a vector or a list and apply a function to each element and output a list. Function arguments (e.g. na.rm=T) follows function name.
lapply(gapminder_sub, mean, na.rm = T)## $lifeExp
## [1] 59.47444
##
## $pop
## [1] 29601212
map(gapminder_sub, mean, na.rm = T)## $lifeExp
## [1] 59.47444
##
## $pop
## [1] 29601212
sapply()/vapply() and map_*()sapply() and vapply() are wrappers around lapply() that simplifies the output from a list to vector/matrix. They basically do the same thing but vapply() is considered to be a safer choice because you have to specify the type of the output. map_*() is a simple alternative for vapply() by specifying the return type.
map_*() |
return |
|---|---|
map_lgl() |
logical vector |
map_int() |
integer vector |
map_dbl() |
double vector |
map_chr() |
character vector |
For example, we want to check the class of each column.
sapply(gapminder, class)## country continent year lifeExp pop gdpPercap
## "factor" "factor" "integer" "numeric" "integer" "numeric"
vapply(gapminder, class, character(1))## country continent year lifeExp pop gdpPercap
## "factor" "factor" "integer" "numeric" "integer" "numeric"
map_chr(gapminder, class)## country continent year lifeExp pop gdpPercap
## "factor" "factor" "integer" "numeric" "integer" "numeric"
vapply() and sapply() can output matrix while map_*() can only output vector.
vapply(gapminder_sub, summary, numeric(6))## lifeExp pop
## Min. 23.59900 60011
## 1st Qu. 48.19800 2793664
## Median 60.71250 7023596
## Mean 59.47444 29601212
## 3rd Qu. 70.84550 19585222
## Max. 82.60300 1318683096
We can write function within apply and map class function as well!
vapply(gapminder_sub, function(x) mean(log(x)), numeric(1))## lifeExp pop
## 4.059839 15.766110
map_dbl(gapminder_sub, function(x) mean(log(x)))## lifeExp pop
## 4.059839 15.766110
data.tableAs an enhanced version of data.frame, one can do a lot more than subsetting rows and selecting columns within the frame of data.table, i.e., within [ ... ]. The general form of data.table syntax is as shown below:
## DT[i, j, by]
## # R: i j by
## # SQL: where select | update group by
The way to read it (out loud) is as suggested:
Take DT, subset rows using i, then calculate j, grouped by by.
Let’s redo the work by dplyr above in the data.table way. This tutorial will only cover the basics of data.table. There are more advanced data.table operations that come in handy. See the data.table cheat sheet here. Comparisons between dplyr, data.table and pandas (a popular Python package) in terms of speed can be found here.
fread() and fwrite()fread() and fwrite() are similar to read.csv() and write.csv() for reading and writing data respectively but they are much faster, especially for BIG dataset.
data.tablesetDT transforms data.frame into data.table. Notice that since data.table is an enhanced version of data.frame, operations on data.frame are still available to data.table. We first make a copy of flights and transform it into data.table.
flights2 <- copy(flights)
setDT(flights2)
class(flights2)## [1] "data.table" "data.frame"
iTo show flights on Jan 1st, 2013:
dplyr:
flights %>% filter(month == 1 & day == 1)data.table:
flights2[month == 1 & day == 1]To get any flights from Jan through June to PHL or SLC airports:
dplyr:
flights %>% filter(dest %in% c("PHL","SLC") & month <= 6)data.table:
flights2[dest %in% c("PHL","SLC") & month <= 6]To get the first N row: dplyr:
flights[1:2,]data.table:
flights2[1:2]iTo sort column origin in ascending order, and then by dest in descending order: dplyr:
flights %>%
select(origin, dest, carrier) %>%
distinct() %>%
arrange(origin, desc(dest), carrier)data.table:
unique(flights2[order(origin, -dest), .(origin, dest, carrier)])jTo select Origin, Destination, and Carrier of flights: dplyr:
flights %>% select(origin, dest, carrier)data.table:
flights2[, list(origin, dest, carrier)]or
flights2[, .(origin, dest, carrier)]or you can also store the columns’ names into a variable and select them using .SDcols and .SD (Subset of Data.table). data.table also has a special syntax .. for selecting columns by a vector.
select_columns <- c("origin", "dest", "carrier")
flights2[, .SD, .SDcols = select_columns]
# flights2[, ..select_columns]When selecting only one column, data.table allows us to return the column as a vector as what the base $ does.
If
To select origin and return it as a vector: dplyr:
flights$origin
flights %>% pull(origin)data.table:
flights2[, origin]To rename dest to destination and carrier to airline:
dplyr:
flights %>%
select(origin, dest, carrier) %>%
rename(dest = destination, airline = carrier)data.table:
renames(flights2,
old = c("dest", "airline"),
new = c("destination", "carrier"))Notice that in dplyr, the syntax is rename(old_name = new_name) which be read as “rename old_name to new_name”.
jOther than selecting columns, j can also handle expressions. For example,
To compute how many trips had total delay < 0: base:
sum((flights$arr_delay + flights$dep_delay) < 0, na.rm = T)## [1] 188401
data.table:
flights2[, sum((arr_delay + dep_delay) < 0, na.rm = T)]## [1] 188401
.NThere are some special symbol in data.table, e.g. .N, .SD, .SDcols. .N holds the number of observations in the current group. Think of it as a counterpart of n() of dplyr.
To count the total number of flights: dplyr:
flights %>%
summarise(num_of_flights = n())## # A tibble: 1 x 1
## num_of_flights
## <int>
## 1 336776
data.table:
flights2[, .(num_of_flights = .N)]## num_of_flights
## 1: 336776
:=To create new columns as what mutate of dplyr does:
dplyr:
flights %>%
mutate(gain = dep_delay - arr_delay)data.table:
flights2[, gain := dep_delay - arr_delay]To remove gain: Use := NULL to remove columns.
flights2[, gain := NULL]## Warning in `[.data.table`(flights2, , `:=`(gain, NULL)): Column 'gain' does not
## exist to remove
bySimilar to group_by of dplyr, data.table applies the operations in j to groups specified in by.
To count the total number of flights from each origin and average delay time of each: dplyr:
flights %>%
group_by(origin) %>%
summarise(num_of_flights = n(),
avg_delay = mean(dep_delay, na.rm = TRUE))data.table:
flights2[, .(num_of_flights = .N,
avg_delay = mean(dep_delay, na.rm = TRUE)),
by = origin]by also takes expressions like group_by does.
To count the number of flights that started late but arrived early (or on time), started and arrived late etc.: dplyr:
flights %>%
filter(!is.na(dep_delay) & !is.na(arr_delay)) %>%
group_by(dep_delay > 0, arr_delay > 0) %>%
summarise(num_of_flights = n())data.table:
flights2[!is.na(dep_delay) & !is.na(arr_delay),
.N, by = .(dep_delay>0, arr_delay>0)]apply functiondata.table supports lapply() function to operation on columns. The following code get the total number of NA of each column. Remember .SDcols and .SD (Subset of Data.table) are for selecting columns.
vapply(flights, function(x) sum(is.na(x)), numeric(1))## year month day dep_time sched_dep_time
## 0 0 0 8255 0
## dep_delay arr_time sched_arr_time arr_delay carrier
## 8255 8713 0 9430 0
## flight tailnum origin dest air_time
## 0 2512 0 0 9430
## distance hour minute time_hour
## 0 0 0 0
flights2[, lapply(.SD, function(x) sum(is.na(x))), .SDcols = names(flights)]## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 0 0 0 8255 0 8255 8713 0
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1: 9430 0 0 2512 0 0 9430 0 0 0
## time_hour
## 1: 0
merge()To join two tables using data.table, use merge() function. The key variable(s) are passed via argument by if the key variable has same name, or using by.x to indicate the key variable for x table and by.y for y table. It uses argument all.x, all.y, all to decide which join to operate.
left_join(): all.x=TRUEright_join(): all.y=TRUEinner_join(): defaultfull_join(): all=TRUEdplyr:
flights %>%
group_by(carrier) %>%
summarise(num_of_flights = n()) %>%
left_join(airlines, by = "carrier") %>%
arrange(-num_of_flights)## # A tibble: 16 x 3
## carrier num_of_flights name
## <chr> <int> <chr>
## 1 UA 58665 United Air Lines Inc.
## 2 B6 54635 JetBlue Airways
## 3 EV 54173 ExpressJet Airlines Inc.
## 4 DL 48110 Delta Air Lines Inc.
## 5 AA 32729 American Airlines Inc.
## 6 MQ 26397 Envoy Air
## 7 US 20536 US Airways Inc.
## 8 9E 18460 Endeavor Air Inc.
## 9 WN 12275 Southwest Airlines Co.
## 10 VX 5162 Virgin America
## 11 FL 3260 AirTran Airways Corporation
## 12 AS 714 Alaska Airlines Inc.
## 13 F9 685 Frontier Airlines Inc.
## 14 YV 601 Mesa Airlines Inc.
## 15 HA 342 Hawaiian Airlines Inc.
## 16 OO 32 SkyWest Airlines Inc.
data.table:
num_of_flights_by_carrier <- merge(flights2[, .(num_of_flights = .N), by = "carrier"],
airlines,
by = "carrier",
all.x = T)
num_of_flights_by_carrier[order(-num_of_flights)]## carrier num_of_flights name
## 1: UA 58665 United Air Lines Inc.
## 2: B6 54635 JetBlue Airways
## 3: EV 54173 ExpressJet Airlines Inc.
## 4: DL 48110 Delta Air Lines Inc.
## 5: AA 32729 American Airlines Inc.
## 6: MQ 26397 Envoy Air
## 7: US 20536 US Airways Inc.
## 8: 9E 18460 Endeavor Air Inc.
## 9: WN 12275 Southwest Airlines Co.
## 10: VX 5162 Virgin America
## 11: FL 3260 AirTran Airways Corporation
## 12: AS 714 Alaska Airlines Inc.
## 13: F9 685 Frontier Airlines Inc.
## 14: YV 601 Mesa Airlines Inc.
## 15: HA 342 Hawaiian Airlines Inc.
## 16: OO 32 SkyWest Airlines Inc.
data.table can tack expressions one after another, forming a chain of operations similar to piping %>%, i.e., DT[ … ][ … ][ … ].
flights2[, .(max_distance = max(distance)), by = .(origin, dest)
][order(-max_distance)]dplyr or data.table?Choosing dplyr or data.table is a personal preference. Here is Stack Overflow post concerning this question. Both authors of dplyr (Hadley) and data.table (Arun) compare their packages in terms of speed, memory usage and syntax. In short, dplyr is more readable (though readability is a subjective question); while data.table performs faster than dplyr as the data size grows, and uses less memory in several functions by its nature. A benchmark comparison between dplyr, data.table and pandas of Python is also available here and a more comprehensive comparison here.
If you really want the pros of both, dtplyr would be a solution. dtplyr allows you to write dplyr code and translate to data.table.
Lastly, we introduce a way to run Rstudio in the cloud based on Rstudio Server. This can come in handy when doing more complex analysis with larger datasets, as it will greatly speed up processing times.
There are many existing guides to help you set up an RStudio instance on the cloud with Google or Amazon Web Services.
Here are a few good guides for using Google Cloud Services:
and Amazon Web Services: